Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
نویسندگان
چکیده
BACKGROUND Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa. PRINCIPAL FINDINGS Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present). CONCLUSIONS We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.
منابع مشابه
شناسایی RNA های غیرکدکننده کوتاه عملکردی با استفاده از روش های بیوانفورماتیکی در گوسفند و بز
MicroRNAs (miRNAs) are small non-coding RNAs that have functional roles in post-transcriptional modification. They regulate gene expression by an RNA interfering pathway through cleavage or inhibition of the translation of target mRNA. Numerous miRNAs have been described for their important functions in developmental processes in numerous animals, but there is limited information about sheep an...
متن کاملFUZZY GRAVITATIONAL SEARCH ALGORITHM AN APPROACH FOR DATA MINING
The concept of intelligently controlling the search process of gravitational search algorithm (GSA) is introduced to develop a novel data mining technique. The proposed method is called fuzzy GSA miner (FGSA-miner). At first a fuzzy controller is designed for adaptively controlling the gravitational coefficient and the number of effective objects, as two important parameters which play major ro...
متن کاملEvaluation of the Performances of Ribosomal Database Project (RDP) Classifier for Taxonomic Assignment of 16S rRNA Metabarcoding Sequences Generated from Illumina-Solexa NGS
Here we report a benchmark of the effect of bootstrap cut-off values of the RDP Classifier tool in terms of data retention along the different taxonomic ranks by using Illumina reads. Results provide guidelines for planning sequencing depths and selection of bootstrap cut-off in taxonomic assignments.
متن کاملSYMBIOTIC ORGANISMS SEARCH AND HARMONY SEARCH ALGORITHMS FOR DISCRETE OPTIMIZATION OF STRUCTURES
In this work, a new hybrid Symbiotic Organisms Search (SOS) algorithm introduced to design and optimize spatial and planar structures under structural constraints. The SOS algorithm is inspired by the interactive behavior between organisms to propagate in nature. But one of the disadvantages of the SOS algorithm is that due to its vast search space and a large number of organisms, it may trap i...
متن کاملOPTIMAL DESIGN OF STEEL MOMENT FRAME STRUCTURES USING THE GA-BASED REDUCED SEARCH SPACE (GA-RSS) TECHNIQUE
This paper proposes a GA-based reduced search space technique (GA-RSS) for the optimal design of steel moment frames. It tries to reduce the computation time by focusing the search around the boundaries of the constraints, using a ranking-based constraint handling to enhance the efficiency of the algorithm. This attempt to reduce the search space is due to the fact that in most optimization pro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 7 شماره
صفحات -
تاریخ انتشار 2012